Recap Lab 5

Leon Eyrich Jessen

A Few Meta Points…

Variable Assignment

We noticed some conceptual challenges here, so recap:

  • In R, we state values directly in the chunk or the console, e.g.:
3
[1] 3
  • Here, we just state 3, so R simply “throws” that right back at you!

  • Now, if want to “catch” that 3 we have to assign it to a variable, e.g.:

x <- 3
  • Notice how now we “catch” the 3 and nothing is “thrown” back to you, because we now have the 3 stored in x:
x
[1] 3

Variable Assignment

We noticed some conceptual challenges here, so recap:

  • Now, we can of course use x moving forward, e.g. by adding 2:
x + 2
[1] 5
  • Notice how this does not change x and the result is simply “thrown” right-back-at-ya
x
[1] 3
  • If we wanted to update x by adding 2, we would have to “catch” the result as before:
x <- x + 2
  • Now, we have updated x:
x
[1] 5

Load Libraries

library("tidyverse")

Remember to load libraires first! You cannot use tools from a toolbox you have not yet “picked up”

Load Data

  • Create the data directory programmatically
dir_create(path = "data")
  • Retrieve the data directly
diabetes_data <- read_csv(file = "https://hbiostat.org/data/repo/diabetes.csv")
  • Write the data to disk
write_csv(x = diabetes_data,
          file = "data/diabetes.csv")
  • Read the data back in
diabetes_data <- read_csv(file = "data/diabetes.csv")
  • Remember to set the eval=FALSE flag in the chunk settings to avoid retrieving the data every time you hit render

Look at data

diabetes_data
# A tibble: 403 × 19
      id  chol stab.glu   hdl ratio glyhb location     age gender height weight
   <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <chr>      <dbl> <chr>   <dbl>  <dbl>
 1  1000   203       82    56  3.60  4.31 Buckingham    46 female     62    121
 2  1001   165       97    24  6.90  4.44 Buckingham    29 female     64    218
 3  1002   228       92    37  6.20  4.64 Buckingham    58 female     61    256
 4  1003    78       93    12  6.5   4.63 Buckingham    67 male       67    119
 5  1005   249       90    28  8.90  7.72 Buckingham    64 male       68    183
 6  1008   248       94    69  3.60  4.81 Buckingham    34 male       71    190
 7  1011   195       92    41  4.80  4.84 Buckingham    30 male       69    191
 8  1015   227       75    44  5.20  3.94 Buckingham    37 male       59    170
 9  1016   177       87    49  3.60  4.84 Buckingham    45 male       69    166
10  1022   263       89    40  6.60  5.78 Buckingham    55 female     63    202
# ℹ 393 more rows
# ℹ 8 more variables: frame <chr>, bp.1s <dbl>, bp.1d <dbl>, bp.2s <dbl>,
#   bp.2d <dbl>, waist <dbl>, hip <dbl>, time.ppn <dbl>

Note the column types are stated, i.e. what kind of data is in each column

  • <chr>: character
  • <dbl>: double

mutate()

  • Change the height, weight, waist and hip from inches/pounds to the metric system (cm/kg), rounding to 1 decimal
diabetes_data <- diabetes_data |>
  mutate(height_cm = round(height * 2.54,
                           digits = 1),
         weight_kg = round(weight * 0.454,
                           digits = 1),
         hip_cm = round(hip * 2.54,
                        digits = 1),
         waist_cm = round(waist * 2.54,
                          digits = 1))
diabetes_data |> 
  select(matches("height|weight|hip|waist")) # Note: Selecting using a regex()
# A tibble: 403 × 8
   height weight waist   hip height_cm weight_kg hip_cm waist_cm
    <dbl>  <dbl> <dbl> <dbl>     <dbl>     <dbl>  <dbl>    <dbl>
 1     62    121    29    38      158.      54.9   96.5     73.7
 2     64    218    46    48      163.      99    122.     117. 
 3     61    256    49    57      155.     116.   145.     124. 
 4     67    119    33    38      170.      54     96.5     83.8
 5     68    183    44    41      173.      83.1  104.     112. 
 6     71    190    36    42      180.      86.3  107.      91.4
 7     69    191    46    49      175.      86.7  124.     117. 
 8     59    170    34    39      150.      77.2   99.1     86.4
 9     69    166    34    40      175.      75.4  102.      86.4
10     63    202    45    50      160       91.7  127      114. 
# ℹ 393 more rows

mutate() OR

inch_to_cm_fct <- 2.54
my_digits <- 1
diabetes_data <- diabetes_data |>
  mutate(height_cm = round(height * inch_to_cm_fct,
                           digits = my_digits),
         weight_kg = round(weight * 0.454,
                           digits = my_digits),
         hip_cm = round(hip * inch_to_cm_fct,
                        digits = my_digits),
         waist_cm = round(waist * inch_to_cm_fct,
                          digits = my_digits))
diabetes_data |> 
  select(matches("height|weight|hip|waist")) # Note: Selecting using a regex()
# A tibble: 403 × 8
   height weight waist   hip height_cm weight_kg hip_cm waist_cm
    <dbl>  <dbl> <dbl> <dbl>     <dbl>     <dbl>  <dbl>    <dbl>
 1     62    121    29    38      158.      54.9   96.5     73.7
 2     64    218    46    48      163.      99    122.     117. 
 3     61    256    49    57      155.     116.   145.     124. 
 4     67    119    33    38      170.      54     96.5     83.8
 5     68    183    44    41      173.      83.1  104.     112. 
 6     71    190    36    42      180.      86.3  107.      91.4
 7     69    191    46    49      175.      86.7  124.     117. 
 8     59    170    34    39      150.      77.2   99.1     86.4
 9     69    166    34    40      175.      75.4  102.      86.4
10     63    202    45    50      160       91.7  127      114. 
# ℹ 393 more rows

Now, if we wanted 2 decimals, we only have to change one variable! 👍

filter()

  • How many men in Buckingham are younger than 30 and taller than 1.9m?
diabetes_data |>
  filter(location == "Buckingham",
         age < 30,
         height_cm > 190)
# A tibble: 1 × 23
     id  chol stab.glu   hdl ratio glyhb location     age gender height weight
  <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <chr>      <dbl> <chr>   <dbl>  <dbl>
1 10000   185       76    58  3.20  4.83 Buckingham    23 male       76    164
# ℹ 12 more variables: frame <chr>, bp.1s <dbl>, bp.1d <dbl>, bp.2s <dbl>,
#   bp.2d <dbl>, waist <dbl>, hip <dbl>, time.ppn <dbl>, height_cm <dbl>,
#   weight_kg <dbl>, hip_cm <dbl>, waist_cm <dbl>

filter()

  • Make a scatter plot of weight versus height and colour by gender for inhabitants of Louisa above the age of 40
diabetes_data |>
  filter(location == "Louisa",
         age > 40) |> 
  ggplot(aes(x = height_cm, y = weight_kg, colour = gender)) +
  geom_point()

filter()

  • Make a scatter plot of weight versus height and colour by gender for inhabitants of Louisa above the age of 40
diabetes_data |>
  mutate(subject = case_when(location == "Louisa" & age > 40 ~ gender,
                             TRUE ~ "Not incl")) |> # Create needed attribute BEFORE plotting
  ggplot(aes(x = height_cm, y = weight_kg, colour = subject)) +
  geom_point(alpha = 0.5)

arrange()

  • How old is the youngest person
# A tibble: 1 × 23
     id  chol stab.glu   hdl ratio glyhb location   age gender height weight
  <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <chr>    <dbl> <chr>   <dbl>  <dbl>
1  4823   193       77    49  3.90  4.31 Louisa      19 female     61    119
# ℹ 12 more variables: frame <chr>, bp.1s <dbl>, bp.1d <dbl>, bp.2s <dbl>,
#   bp.2d <dbl>, waist <dbl>, hip <dbl>, time.ppn <dbl>, height_cm <dbl>,
#   weight_kg <dbl>, hip_cm <dbl>, waist_cm <dbl>
  • How old is the oldest person
# A tibble: 1 × 23
     id  chol stab.glu   hdl ratio glyhb location     age gender height weight
  <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <chr>      <dbl> <chr>   <dbl>  <dbl>
1  2770   165       94    69  2.40  4.98 Buckingham    92 female     62    217
# ℹ 12 more variables: frame <chr>, bp.1s <dbl>, bp.1d <dbl>, bp.2s <dbl>,
#   bp.2d <dbl>, waist <dbl>, hip <dbl>, time.ppn <dbl>, height_cm <dbl>,
#   weight_kg <dbl>, hip_cm <dbl>, waist_cm <dbl>

mutate()

  • Create a new variable, where you calculate the BMI

case_when()

  • Create a BMI_class variable

Factor levels

diabetes_data |>
  count(BMI_class) |>
  drop_na(BMI_class) |> 
  mutate(pct = n / sum(n) * 100) |>
  ggplot(aes(x = BMI_class, y = pct)) +
  geom_col()
  • Columns order is nonsensical

Factor levels

Factor levels

diabetes_data |>
  count(BMI_class) |>
  drop_na(BMI_class) |> 
  mutate(pct = n / sum(n) * 100) |>
  ggplot(aes(x = BMI_class, y = pct)) +
  geom_col()
  • Columns order is now sensical

group_by() |> summarise()

  • For each BMI_class group, calculate the average weight and associated standard deviation
# A tibble: 8 × 4
  BMI_class                n mu_weight_kg sigma_weight_kg
  <fct>                <int>        <dbl>           <dbl>
1 Severely underweight     2         49.5            5.80
2 Underweight              7         52.2            6.43
3 Normal weight          112         65.9            9.77
4 Overweight             124         76.2            8.86
5 Obesity class I         90         89.2            9.97
6 Obesity class II        32        102.            13.5 
7 Obesity class III       30        115.            15.8 
8 <NA>                     6         NA             NA   
  • <fct> = a factor variable, i.e. a category!

Assignment

Data augmentation

  • Create a BFP (Body fat percentage) variable
  • Create a WHR (waist-to-hip ratio) variable
# A tibble: 3 × 28
     id  chol stab.glu   hdl ratio glyhb location     age gender height weight
  <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <chr>      <dbl> <chr>   <dbl>  <dbl>
1 40751   239       85    63  3.80  5.16 Louisa        39 male       60    144
2 13250   181      255    26  7     9.58 Buckingham    50 male       71    320
3 15515   229       91    43  5.30  4.73 Louisa        23 male       72    180
# ℹ 17 more variables: frame <chr>, bp.1s <dbl>, bp.1d <dbl>, bp.2s <dbl>,
#   bp.2d <dbl>, waist <dbl>, hip <dbl>, time.ppn <dbl>, height_cm <dbl>,
#   weight_kg <dbl>, hip_cm <dbl>, waist_cm <dbl>, BMI <dbl>, BMI_class <chr>,
#   gender_class <dbl>, BFP <dbl>, WHR <dbl>

Which correlate better with BMI, WHR or BFP?

Which correlate better with BMI, WHR or BFP?

# A tibble: 806 × 3
     BMI metric  value
   <dbl> <chr>   <dbl>
 1  22.1 WHR     0.763
 2  22.1 BFP    29.1  
 3  37.4 WHR     0.958
 4  37.4 BFP    47.7  
 5  48.4 WHR     0.860
 6  48.4 BFP    67.6  
 7  18.6 WHR     0.868
 8  18.6 BFP    17.3  
 9  27.9 WHR     1.07 
10  27.9 BFP    29.6  
# ℹ 796 more rows
# A tibble: 2 × 2
  metric   PCC
  <chr>  <dbl>
1 BFP    0.888
2 WHR    0.104

Which correlate better with BMI, WHR or BFP?

A Better Way?

# A tibble: 2 × 4
  metric   PCC    R2 label     
  <chr>  <dbl> <dbl> <chr>     
1 BFP    0.888 0.789 R2 = 0.789
2 WHR    0.104 0.011 R2 = 0.011

Which correlate better with BMI, WHR or BFP?

NB! “Tim Toady” (TIMTOWTDI) and - As much as possible, prepare your data to contain what you need before plotting!